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Abstract. This paper considers the problem of representative selection: 
choosing a subset of data points from a dataset that best represents 
its overall set of elements. This subset needs to inherently reflect the 
type of information contained in the entire set, while minimizing redun¬ 
dancy. For such purposes, clustering may seem like a natural approach. 
However, existing clustering methods are not ideally suited for repre¬ 
sentative selection, especially when dealing with non-metric data, where 
only a pairwise similarity measure exists. In this paper we propose 5- 
medoids, a novel approach that can be viewed as an extension to the 
fc-medoids algorithm and is specifically suited for sample representative 
selection from non-metric data. We empirically validate d-medoids in two 
domains, namely music analysis and motion analysis. We also show some 
theoretical bounds on the performance of d-medoids and the hardness of 
representative selection in general. 


1 Introduction 

Consider the task of a teacher who is charged with introducing his class to 
a large corpus of songs (for instance, popular western music since 1950). In 
drawing up the syllabus, this teacher will need to select a relatively small set of 
songs to discuss with his students such that 1) every song in the larger corpus is 
represented by his selection (in the sense that it is relatively similar to one of the 
selected songs) and 2) the set of selected songs is small enough to cover in a single 
semester. This task is an instance of the representative selection problem. Similar 
challenges often arise in tasks related to data summarization and modeling. For 
instance, finding a characteristic subset of Facebook profiles out of a large set, or 
a subset of representative news articles from the entire set of news information 
gathered during a single day from many different sources. 

On its surface, representative selection is quite similar to clustering, a more 
widely studied problem in unsupervised learning. Clustering is one of the most 
widespread tools for studying the structure of data. It has seen extensive usage 
in countless research disciplines. The objective of clustering is to partition a 



given data set of samples into subsets so that samples within the same subset 
are more similar to one another than samples belonging to different subsets. 
Several surveys of clustering techniques can be found in the literature |20l41j . 

The idea of reducing a full set to a smaller set of representatives has been 
suggested before in specific contexts, such as clustering xml documents m or 
dataset editing m, and more recently in visual [HC] and text summarization 
[28] , It has also been discussed as a general problem in [39]. These recurring 
notions can be formalized as follows. Given a large set of examples, we seek a 
minimal subset that is rich enough to encapsulate the entire set, thus achieving 
two competing criteria - maintaining a representative set as small as possible, 
while satisfying the constraint that all samples are within 6 from some represen¬ 
tative. In the next subsections we define this problem in more exact terms, and 
motivate the need for such an approach. 

While certainly related, clustering and representative selection are not the 
same problem. A seemingly good cluster may not necessarily contain a natural 
single representative, and a seemingly good partitioning might not induce a good 
set of representatives. For this reason, traditional clustering techniques are not 
necessarily well suited for representative selection. We expand on this notion in 
the next sections. 


1.1 Representative Selection: Problem Definition 

Let S' be a data set, d : S x S —>■ K"*" be a distance measure (not necessarily 
a metric), and <5 be a distance threshold below which samples are considered 
sufficiently similar. We are tasked with finding a representative subset C C S 
that best encapsulates the data. We impose the following two requirements on 
any algorithm for finding a representative subset: 

— Requirement 1; The algorithm must return a subset C C S such that for 
any sample x G S, there exists a sample c G C satisfying d{x, c) < S. 

— Requirement 2; The algorithm cannot rely on a metric representation of 
the samples in S. 

To compare the quality of different subsets returned by different algorithms, we 
measure the quality of encapsulation by two criteria: 

— Criterion 1; \C\ - we seek the smallest possible subset C that satisfies 
Requirement 1;. 

— Criterion 2; We would also like the representative set to best fit the data 
on average. Given representative subsets of equal size, we prefer the one that 
minimizes the average distance of samples from their respective representa¬ 
tives. 

Criteria 1 and 2 are applied on a representative set solution. In addition, we 
expect the following desiderata for a representative selection algorithm. 





— Desideratum 1; We prefer representative selection algorithms which are 

stable. Let Ci and C 2 be different representative subsets for dataset S ob¬ 
tained by two different runs of the same algorithm. Stability is defined as 
the overlap The higher the expected overlap is, the more stable the 

algorithm is. This desideratum ensures the representative set is robust to 
randomization in data ordering or the choices made by the algorithm. 

— Desideratum 2; We would like the algorithm to be efficient and to scale 
well for large datasets. 

Though not crucial for correctness, the first desideratum is useful for consistency 
and repeatability. We further motivate the reason for desideratum 1 in Appendix 
B, and show it is reasonably attainable. 

The representative selection problem is similar to the e-covering number problem 
in metric spaces |4dj . The e-covering number measures how many small spheri¬ 
cal balls would be needed to completely cover (with overlap) a given space. The 
main difference is that in our case we also wish the representative set to closely 
fit the data (Criterion 2). Criteria 1 and 2 are competing goals, as larger repre¬ 
sentative sets allow for lower average distance. In this article we focus primarily 
on criterion 1, using criterion 2 as a secondary evaluation criterion. 

1.2 Testbed Applications 

Representative selection is useful in many contexts, particularly when the full 
dataset is either redundant (due to many near-identical samples) or when using 
all samples is impractical. For instance, given a large document and a satisfactory 
measure of similarity between sentences, text summarization [25] could be framed 
as a representative selection task - obtain a subset of sentences that best captures 
the nature of the document. Similarly, one could map this problem to extracting 
“visual words” or representative frames from visual input mEi]. This paper 
examines two concrete cases in which representatives are needed: 

— Music analysis - the last decade has seen a rise in the computational 
analysis of music databases for music information retrieval [5], recommender 
systems [23] and computational musicology [5|. A problem of interest in 
these contexts is to extract short representative musical segments that best 
represent the overall character of the piece (or piece set). This procedure is 
in many ways analogous to text summarization. 

— Team strategy/behavior analysis - opponent modeling has been dis¬ 
cussed in several contexts, including game playing |3] , real-time agent track¬ 
ing m and general multiagent settings |3] . Given a large dataset of recorded 
behaviors, one may benefit from reducing this large set into a smaller col¬ 
lection of prototypes. In the results section, we consider this problem as a 
second testbed domain. 

What makes both these domains appropriate as testbeds is that they are realis¬ 
tically rich and induce complex, non-metric distance relations between samples. 





The structure of this paper is as follows. In the following section we provide a 
more extensive context to the problem of representative selection and discuss why 
existing approaches may not be suitable. In Section 3 we introduce i5-medoids, an 
algorithm specifically designed to tackle the problem as we formally defined it. 
In Section 4 we show some theoretical analysis of the suggested algorithm, and 
in Section 5 we show its empirical performance in the testbed domains described 
above. 

2 Background and Related Work 

There are several existing classes of algorithms that solve problems related to 
representative selection. This section reviews them and discusses the extent to 
which they are (or are not) applicable to our problem. 

2.1 Limitations of Traditional Clustering 

Given the prevalence of clustering algorithms, it is tempting to solve represen¬ 
tative selection by clustering the data and using cluster centers (if they are in 
the set) as representatives, or the closest point to each center. In some cases it 
may even seem sufficient, once the data is clustered, to select samples chosen at 
random from each cluster as representatives. However, such an approach usu¬ 
ally only considers the average distance between representatives and samples, 
and it is unlikely to yield good results with respect to any other requirement, 
such as minimizing the worst case distance or maintaining the smallest set pos¬ 
sible. Moreover, the task of determining the desirable number of clusters k for a 
sufficient representation can be a difficult challenge in itself. 

Consider the example in Figure 1: given a set of l^l = 100 points, and a dis¬ 
tance measure (in this case, the Euclidean distance metric), we seek a set of 
representatives that is within distance 1 of every point in the set, thus satisfying 
Criterion 1 with 5 = 1. Applying a standard clustering approach on this set, the 
distance constraint is only consistently met when k > 77 (and rarely with less 
than 70 samples). Intuitively, a large number of clusters is required to ensure 
that no sample is farther than S from a centroid. However, we can obtain the 
same coverage goal with only 13 samples. Defining a distance criterion rather 
than a desired number of clusters has a subtle but crucial impact on the problem 
definition. 


2.2 Clustering and Spatial Representation 

The above limitation of clustering applies even when the data can be embedded 
as coordinates in some n-dimensional vector space. However, in many cases, in¬ 
cluding our motivating domain of music analysis, the data does not naturally 
fit in such a space. This constraint renders many common clustering techniques 
inapplicable, including the canonical fc-means [24) . or more recent works such 
as [39]. Furthermore, the distance function we construct (detecting both local 
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Fig. 1. Clustering vs. representative selection, (a) When applying fc-medoids (defined 
in Section 2.4), k = 77 clusters are required to satisfy the distance condition, (b) A 
better representative set does so with only 13 representatives. 


and global similarities) is not a metric, since it violates the triangle inequality. 
Because of this property, methods reliant on a distance metric are also inappli¬ 
cable. Among such methods are neighbor-joining |32j . which becomes unreliable 
when applied on non-metric data, or the fc-prototypes algorithm 1110 
Nevertheless, certain clustering algorithms still apply, such as the fc-medoids al¬ 
gorithm, hierarchical clustering [35] , and spectral clustering |35| . These methods 
can be employed directly on a pairwise (symmetric) similarity matrix]^ while 
satisfying the triangle inequality is not a requirement. 

2.3 The fc-Medoids Algorithm 

The fc-medoids algorithm m is a variation on the classic /c-means algorithm that 
only selects centers from the original dataset, and is applicable to data organized 
as a pairwise distance matrix. The algorithm partitions a set of samples to a 
predetermined number k based on the distance matrix. Similarly to the /c-means 
algorithm, it does so by starting with k random centers, partitioning the data 
around them, and iteratively moving the k centers toward the medoids of each 
cluster (a medoid is defined as medoids = argmin^^^sd(x,s)). 

bGS 

All of the approaches mentioned so far are specifically designed for dividing 
the data to a fixed number of partitions. In contrast, representative selection 
defines a distance (or coverage) criterion S, rather than a predetermined number 
of clusters k. In that respect, fc-medoids, or spectral and hierarchical clustering, 

^ In certain contexts, metric learning iwn] can be applied, but current methods are 
not well suited for data without vector space representation, and in some sense, 
learning a metric is of lesser interest for representative selection, as we care less 
about classification or the structural relations latent in the data. 

For spectral clustering, the requirement is actually an affinity (or proximity) matrix. 











Algorithm 1 Extended Greedy K-Centers Approach (farthest first traversal) 
1: Input: data sampleSet = xg ■ ■ ■ Xm, required distance 5 
2: choose random starting representative Xi 
3: representativeSet = {xi} 

4: sampleSet = xg ... Xi-i, ... Xm 

5: maximalDist = maXs£sam.pieSetd{s, representativeSet 

6: while maximalDist > S do 

7: maximalElement = argmaXs<=sam,pieSetd{s, representativeSet) 

8: representativeSet = representativeSet U {maximalElement} 

9: sampleSet = representativeSet/{maximalElement} 

10: maximalDist = maXs£sampieSetd{s, representativeSet) 

11: end while 


force us to search for a partition that satisfies this distance criterion. Applying a 
clustering algorithm to representative selection requires an outer loop to search 
for an appropriate k, a process which can be quite expensive. 

We note that traditionally, both hierarchical methods and spectral clustering 
require the full pairwise distance matrix. If the sample set S is large (The usual 
use case for representative selection), computing a pairwise [S'! x jS”! distance 
matrix can be prohibitively expensive. In the case of spectral clustering, an 
efficient algorithm that does not compute the full distance matrix exists [31] , but 
it relies on a vector space representation of the data, rendering it inapplicable in 
our case QThe algorithm we introduce in this article does not require a distance 
metric, nor does it rely on such a spatial embedding of the data, which makes it 
useful even in cases where very little is known about the samples beyond some 
proximity relation. 


2.4 fe-Centers Approach 

A different, yet related, topic in clustering and graph theory is the /c-centers prob¬ 
lem. Let the distance between a sample s and a set C be: 
d{s,C) = minceC'd{s,c). The fc-centers problem is defined as follows: Given 
a set S and a number fc, find a subset R C S,\R\ = k so that maXsesd{s, R) is 
minimal [18] . 

In metric spaces, an efficient 2-approximation algorithm for this problem exists 
as follows]^ First choose a random representative. Then, for fc — 1 times, add the 
element farthest away from the representative set R to R. This approach can 
be directly extended to suit representative selection - instead of repeating the 
addition step k—1 times, we can continue adding elements to the representative 
set until no sample is > d away from any some representative (see Algorithm 1). 

® In some cases the distance matrix can be made sparse via AD-trees and nearest- 
neighbor approximations, which also require a metric embedding. [5] 

® We note that no better approximation scheme is possible under standard complexity 
theoretic assumptions HSj. 






While this algorithm produces a legal representative set, it ignores criterion 2 
(average distance). 

Another algorithm that is related to this problem is Gonzales’ approximation 
algorithm for minimizing the maximal cluster diameter m, which iteratively 
takes out elements from existing clusters to generate new clusters based on the 
inter-cluster distance. This algorithm is applicable in our setting since it too only 
requires pairwise distances, and can produce a legal coverage by partitioning the 
data into an increasing number of cluster until the maximal diameter is less 
than S (at which point any sample within a cluster covers it). This approach is 
wasteful for the purpose of representative selection, since it forces a much larger 
number of representatives than needed. 

Lastly, in a recent, strongly related paper El, the authors consider a similar 
problem of selecting exemplars in data to speed up learning. They do not pose 
hard constraints on the maximal distance between exemplars and samples, but 
rather frame this task as an optimization problem, softly associating each sample 
with a “likelihood to represent” any other sample, and trying to minimize the ag¬ 
gregated coverage distance while also minimizing the norm of the representation 
likelihood matrix. Though very interesting, it’s hard to enable this method to 
guarantee a desired minimal distance, and the soft association of representatives 
to samples is inadequate to our purposes. 

3 The 5-Medoids Algorithm 

In this section, we present the novel (5-medoids algorithm, specifically designed 
to solve the representative selection problem. The algorithm does not assume a 
metric or a spatial representation, but rather relies solely on the existence of some 
(not necessarily symmetric) distance or dissimilarity measure d : S x S ^ K“'". 
Similarly to the fc-centers solution approach, the (5-medoids approach seeks to 
directly find samples that sufficiently cover the full dataset. The algorithm does 
so by iteratively scanning the dataset and adding representatives if they are suffi¬ 
ciently different from the current set. As it scans, the algorithm associates a clus¬ 
ter with each representative, comprising the samples it represents. Then, the al¬ 
gorithm refines the selected list of representatives, in order to reduce the average 
coverage distance. This procedure is repeated until the algorithm reaches con¬ 
vergence. Thus, we address both minimality (criterion 1) and average-distance 
considerations (criterion 2). We show in Sectionj^that this algorithm achieves its 
goal efficiently in two concrete problem domains, and does so directly, without 
the need for optimizing a meta-parameter k. 

We first introduce a simpler, single-iteration i5-representative selection algorithm 
on which the full (5-medoids algorithm is based. 


3.1 Straightforward ^-Representative Selection 

Let us consider a more straightforward “one-shot” representative selection al¬ 
gorithm that meets the 5-distance criterion. The algorithm sweeps through the 


Algorithm 2 One-shot ^-representatives selection algorithm 
1: Input: data xo ■ ■ ■ Xm, required distance S 
2: Initialize representatives = 0. 

3: Initialize clusters = 0 

4: representative assignment subroutine, RepAssign, lines 5-22: 
5: for i = 0 to m do 
6: Initialize dist = cxa 

7: Initialize representative = null 

8: for rep in representatives do 

9: if d{xi,rep) < dist then 

10: representative = rep 

11: dist = d{xi, rep) 

12: end if 

13: end for 

14: if dist < 6 then 

15- add Xi to cluster representative 

16: else 

17: representative = Xi 

18- Initialize cluster representative — 0 

19- add Xi to cluster representative 

20: add clusterrepresentative to clusters 

21: end if 

22: end for 


elements of S, and collects a new representative each time it observes a suffi¬ 
ciently “new” element. Such an element needs to be > <5 away from any previ¬ 
ously collected representative. The pseudocode for this algorithm is presented in 
Algorithm 2. 

While this straightforward approach works well in the sense that it does produce 
a legal representative set, it is sensitive to scan order, therefore violating the 
desired stability property. More importantly, it does not address the average 
distance criterion. For these reasons, we extend this algorithm into an iterative 
one, a hybrid of sorts between direct representative selection and EM clustering 
approaches. 

3.2 The Pull d-Medoids Algorithm 

This algorithm is based on the straightforward approach, as described in Sec¬ 
tion 3.1. However, unlike Algorithm 2, it repeatedly iterates through the sam¬ 
ples. In each iteration, the algorithm associates each sample to a represen¬ 
tative so that it is never > 5 away from some representative (the RepAs¬ 
sign subroutine, see Algorithm 3), just as in Algorithm 2. The main differ¬ 
ence is that at the end of each iteration it subsequently finds a closer-fitting 
representative for each cluster S associated with representative s. Concretely, 
representatives = medoids = argminj^x^s (lines 8— 13), under the 

sGS 

constraint that no sample G S is farther than 6 from medoids- This step en- 




Algorithm 3 The ^-medoid representative selection algorithm. 

1: Input: data xq . . . required distance S 
2: t = 0 

3: Initialize representativest—o — 0- 
4: Initialize clusters — 0 
5: repeat 
6: t = t + l 

7: call RepAssign subroutine, lines 5-22 of Algorithm 2 

8: Initialize representativest — 0 

9: for cluster in clusters do 

10: representative — argmin Xlxeciitster^ cluster.d(x, s) < 5 

s^cluster 

11: add representative to representativest 

12: end for 

13: until representativest = representativest—i 


sures that a representative is “best-fit” on average to the cluster of samples it 
represents, without sacrificing coverage. In other words, while trying to mini¬ 
mize the size of the representative set, the algorithm also addresses criterion 
2 - average distance as low as possible. This step also drastically improves the 
stability of the retrieved representative set under different permutations of the 
data (desideratum 1). We note that by adding the constraint that new represen¬ 
tatives must still cover the clusters they were selected from, we guarantee that 
the number of representatives k does not increase after the first scan. 

The process is repeated until (5-medoids reaches convergence, or until we reach 
a representative set which is “good enough” (remember that at the end of each 
cluster-association iteration we have a set that satisfies the distance criterion). 
This algorithm uses a greedy heuristic that is indeed ensured to converge to 
some local optimum (Theorem [^. This local optimum is dependent on the value 
of 5 and the structure of the data. In Subsection |4.1[ we show that solving the 
representative selection problem for a given S is NP-hard, and therefore heuristics 
are required. 

Theorem 1 Algorithm 3 converges after a finite number of steps. 

See Appendix A for proof sketch. 


3.3 Merging Close Clusters 

Since satisfying the distance constraint with a minimal set of representatives 
(Criterion 1) is NP-hard (see Section 4), the J-medoids algorithm is not guar¬ 
anteed to do so. A simple optimization procedure can reduce the number of 
representatives in certain cases. For instance, in some cases, oversegmentation 
may ensue. To abate such an occurrence, it is possible to iterate through rep¬ 
resentative pairs that are no more than 5 apart, and see whether joining their 
respective clusters could yield a new representative that covers all the samples 
in the joined clusters. If it is possible, the two representatives are eliminated in 
favor of the new joint representative. The process is repeated until no pair in the 
potential pair list can be merged. This procedure can be generalized for larger 







representative group sizes, depending on computational tractability. These re¬ 
finement steps can be taken after each iteration of the algorithm. If the number 
of representatives is high, however, this approach may be computationally in¬ 
feasible altogether. Although this procedure was not required in our problem 
domains (see Section]^, we believe it may still prove useful in certain cases. 

4 Analysis Summary 

In this section, we present the hardness of the representative selection problem, 
and briefly discuss the efficiency of the d-medoids algorithm. We show that the 
problem of finding a minimal representative set is NP-hard, and provide certain 
bounds on the performance of i5-medoids in metric spaces with respect to repre¬ 
sentative set size and average distance. We continue to show that approximating 
the representative selection problem is NP-hard in non-metric spaces, both in 
terms of the representative set size and with respect to the maximal distance. 
For the sake of readability, we present full details in Appendix D. 

4.1 NP-Hardness of the Representative Selection Problem 
Theorem 2 Satisfying Criterion 1 (minimal representative set) is NP-Hard. 

4.2 Bounds on <5-medoids in Metric Spaces 

The i5-medoids algorithm is agnostic to the existence of metric space in which 
the samples can be embedded. That being said, it can work equally well in cases 
where the data is metric (in Appendix C we demonstrate the performance of the 
(5-medoids algorithm in a metric space test-case). However, we can show that if 
the measure which generates the pairwise distances is in fact a metric, certain 
bounds on performance exist. 

Theorem 3 In a metric space, the average distance of a representative set \C\ = 
k obtained by the 5-medoids algorithm is bound by 20PT where OPT is the 
maximal distance obtained by an optimal assignment of k representatives (with 
respect to maximal distance). 

Theorem 4 The size of the representative set returned by the 5-medoids algo¬ 
rithm, k, is bound by k < N(|) where N(x) is the minimal number of represen¬ 
tatives required to satisfy distance criterion x. 

4.3 Hardness of Approximation of Representative Selection in 
Non-Metric Spaces 

In non-metric spaces, the representative selection problem becomes much harder. 
We now show that no c-approximation exists for the representative selection 
problem either with respect to the first criterion (representative set size) or the 
second criterion (distance - we focus on maximal distance but a similar outcome 
for average distance is implied). 


Theorem 5 No constant-factor approximation exists for the representative se¬ 
lection set problem with respect to representative set size. 

Theorem 6 For representative sets of optimal size tQ no constant-factor ap¬ 
proximation exists with respect to the maximal distance between the optimal rep¬ 
resentative set and the samples. 


4.4 EfHciency of J-Medoids 

The actual run time of the algorithm is largely dependent on the data and the 
choice of 6. An important observation is that at each iteration, each sample 
is only compared to the current representative set, and a sample is introduced 
to the representative set only if it is > ^ away from all other representatives. 
After each iteration, the representatives induce a partition to clusters and only 
samples within the same cluster are compared to one another. While in the 
worst case the runtime complexity of the algorithm can be O(IS'p), in practice 
we can get considerably better runtime performance, closer asymptotically to 
We note that in each iteration of the algorithm, after the partitioning 
phase (the RepAssign subroutine in Algorithm the algorithm maintains a 
legal representative set, so in practice we can halt the algorithm well before 
convergence, depending on need and resources. 


5 Empirical Results 

In this section, we analyze the performance of the (5-medoids algorithm empir¬ 
ically in two problem domains - music analysis and agent movement analysis. 
We show that d-medoids does well on Criterion 1 (minimizing the representative 
set) while obtaining a good solution for Criterion 2 (maintaining a low aver¬ 
age distance). We compare ourselves to three alternative methods - /c-medoids, 
the greedy fc-center heuristic, and spectral clustering (using cluster medoids as 
representatives) [33], and show we outperform all three. We note that although 
these methods weren’t necessarily designed to tackle the representative selection 
problem, they, and clustering approaches in general, are used for such purposes 
in practice (see m, for instance). To obtain some measure of statistical sig¬ 
nificance, for each dataset we analyze, we take a random subset of IS”! = 5000 
samples and use this subset as input for the representative selection algorithm. 
We repeat this process TV = 20 times, averaging the results and obtaining stan¬ 
dard errors. We show that the (5-medoid algorithm produces representative sets 
at least as compact as those produced by the fc-centers approach, but obtains 
a much lower average distance. We further note it does so directly, without the 
need for first optimizing the number of clusters k, unlike fc-medoids or spectral 
clustering. 

^ In fact, this proof applies for any value of k that cannot be directly manipulated by 
the algorithm. 



In Appendix C, we also demonstrate the performance of the algorithm in a 
standard metric space, and show it outperforms the other methods in this setting 
as well. 

5.1 Distance Measures 

In both our problem domains, no simple or commonly accepted measure of dis¬ 
tance between samples exists. For this reason, we devised a distance function for 
each setting, based on domain knowledge and experimentation. Feature selec¬ 
tion and distance measure optimization are beyond the scope of this work. For 
completeness, the full details of our distance functions appear in Appendix E. 
We believe the results are not particularly sensitive to the choice of a specific 
distance function, but we leave such analysis to future work. 

5.2 Setting 1 - musical segments 

In this setting, we wish to summarize a set of musical pieces. This domain 
illustrates many of the motivations listed in Section 1. The need for good rep¬ 
resentative selection is driven by several tasks, including style characterization, 
comparing different musical corpora (see |12jb and music classification by com¬ 
poser, genre or period [2]. 

For the purpose of this work we used the Music21 corpus, provided in MusicXML 
format [3]. For simplicity, we focus on the melodic content of the piece, which 
can be characterized as the variation of pitch (or frequency) over time. 


Data We use thirty musical pieces: 10 representative pieces by Mozart, Beethoven, 
and Haydn. The melodic lines in the pieces are isolated and segmented using ba¬ 
sic grouping principles adapted from [29j. In the segmentation process, short 
overlapping melodic sequences 5 to 8 beats long are generated. For example, 
three such segments are plotted in Figure 2 as pitch variation over time. For 
each movement and each instrument, segmentation results in 55 — 518 segments. 
All in all we obtain 20,000 — 40,000 segments per composer. 


Distance measure We devise a fairly complex distance measure between any 
two musical segments Si and S 2 - Several factors are taken into account: 

— Global alignment - the global alignment score between the two segments, 
calculated using the Needleman-Wunsch algorithm m- 

— Local alignment - the local alignment score between the two segments, cal¬ 
culated using the Smith-Waterman algorithm [5S] . Local alignment is useful 
if two sequences are different overall, but share a meaningful subsequence. 

— Rhythmic overlap, interval overlap, step overlap, pitch overlap - the extent to 
which one-step melodic and rhythmic patterns in the two segments overlap, 
using a “bag”-like distance function dsetiAi, A 2 ) = 




Fig. 2. Three musical segments as pitch (in MIDI format) over time, along with the 
musical notation of the first segment (IstViolinSeg). 


The different factors are then weighted and combined. This measure was chosen 
because similarity between sequences is multifaceted, and the different factors 
above capture different aspects of similarity, such as sharing a general contour 
(global alignment), a common motif (local alignment), or a similar “musical 
vocabulary” (the other factors, which by themselves each capture a different 
aspect of musical language). The result is a measure but not a metric since local 
alignment may violate the triangle inequality. 


Results We compare 5-medoids to the fc-medoids algorithm and the greedy 
fc-center heuristic for five different S values. The results are presented in Figure 
3. For each composer and <5, we searched exhaustively for the lowest k value for 
which /c-medoids met the distance requirement. We study both the size of the 
representative set obtained and the average sample-representative distance. 

From the representative set size perspective, for all choices of d the (5-medoids 
algorithm obtains better coverage of the data compared to the /c-medoids, and 
does at least as well (and most often better) compared to the greedy fc-centers 
heuristic. However, in terms of average distance, ^-medoids performs much better 
compared to the A:-centers heuristic, implying that the J-medoids algorithm out¬ 
performs the other two. While spectral clustering seems to satisfy the distance 
criteria with a small representative set for small values of <5, its non-centroid 
based nature makes it less suitable for representative selection, as a more lax <5 
criterion might not necessarily mean a smaller representative set will be needed 
(as apparent from the result). Indeed, as the value of S increases, the (5-medoids 
algorithm significantly outperforms spectral clustering. 



















Fig. 3. Representative set size in % from entire set and average representative set dis¬ 
tance for three different composers, ten different pieces each, and five different distance 
criteria. Each column represents data for a different composer. J-medoids yields the 
most compact representative set overall while still obtaining a smaller average distance 
than the fc-centers heuristic. 


5.3 Setting 2 - agent movement in robot soccer simulation 

As described in Section |1.2[ analyzing agent behavior can be of interest in sev¬ 
eral domains. The robot world-cup soccer competition (RoboCup) is a well- 
established problem domain for AI in general [22]. In this work we chose to 
focus on the RoboCup 2D Simulation League. We have collected game data 
from several full games from the past two Robocup competitions. An example 
for the gameplay setting and potential movement trajectories can be seen in 
Figure 5. 

Our purpose is to extract segments that best represent agent movement patterns 
throughout gameplay. In the specific context of the Robocup simulation league, 
there are several tasks that motivate representative selection, including agent 
and team characterization, and learning training trajectories for optimization. 


Data Using simulation log data, we extract the movement of the 22 agents over 
the course of the game {^timesteps = 6000). The agents move in 2-dimensional 
space (three example trajectories can be seen in Figure 5). We extract I-second 
(10 timestamps) long, partially overlapping segments from the full game trajec¬ 
tories of all the agents on the field except the goalkeeper, who tends to move 
less and in a more confined space and for the purpose of this task is of lesser 
interest. That leads to 900 • 20 = 18000 movement segments in total per game. 
We analyzed 4 teams and 5 games (90000 segments) per team. 






















Fig. 4. The RoboCup 2D Simulation. Three potential movement trajectories for a 
specific agents are marked. 


Distance measure Given two trajectories, one can compare them as contours 
in 2-dimensional space. We take an alignment-based approach, with edit costs 
being the RMS distance between them. Our distance measure is comprised of 
three elements: global and local alignment (same as in music analysis), and a 
“bag of words”-style distance based on patterns of movement-and-turn sequences 
(turning is quantized into 6 angle bins). As in music analysis, the reason for this 
approach is that similarity in motion is hard to define, and we believe each 
feature captures different aspects of similarity. As in the previous setting, this is 
not a metric, as local alignment may violate the triangle inequality. 


Results As in the previous setting, we compare (5-medoids to the fc-medoids 
algorithm as well as the greedy fc-center heuristic, for five different game logs 
and five different S values. The results are presented in Figure 5. As before, for 
each (5, we searched exhaustively for the optimal choice of k in fc-medoids. 

The results reinforce the conclusions reached in the previous domain - for all 
choices of S we meet the distance requirement using a much smaller representa¬ 
tive set compared to the fc-medoids and spectral clustering approaches (which 
does much worse in this domain compared to the previous one). Furthermore, 
d-medoids once again does at least as well as the fc-centers heuristic. In terms of 
average distance, our algorithm performs much better compared to the fc-centers 
heuristic, suggesting that the i5-medoids algorithm generally outperforms the 
other approaches. 
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Fig. 5. Representative set size in % from entire set for four different teams, five different 
game logs each, and five distance criteria. Each column represents game data for a 
different team. Axes denoting distance are in log-scale. 


5.4 Stability of the ^-Medoids Algorithm 

In this section we establish that indeed the i5-medoids algorithm is robust with 
respect to scan order (satisfying desideratum 1 from Section 1). To test stability, 
we ran 5-medoids, fc-medoids, the fc-center heuristic and spectral clustering mul¬ 
tiple times on a large collection of datasets, reshuffling the input order on each 
iteration, and examined how well preserved the representative set was across 
iterations and methods. Our analysis indicated that the first three algorithms 
consistently obtain > 90% average overlap, and the level of stability observed is 
almost identical. Spectral clustering yields drastically less stable representative 
sets. For a fuller description of these results see Appendix B. 


6 Summary and Discussion 

In this paper, we present a novel heuristic algorithm to solve the representative 
selection problem: finding the smallest possible representative subset that best 
fits the data under the constraint that no sample in the data is more than a 
predetermined parameter 6 away from some representative. We introduce the 
novel i5-medoids algorithm and show it outperforms other approaches that are 
only concerned with either best fitting the data into a given number of clusters, 
or minimizing the maximal distance. 

There is a subtle yet significant impact to focusing on a maximal distance crite¬ 
rion 6 rather than choosing the number of clusters k. While both (5-medoids and 
fc-medoids aim to minimize the sum of distances between representatives and 


















the full set, fc-medoids does so with no regard to any individual distance. Be¬ 
cause of this, we need to increase the value of k drastically in order to guarantee 
that our distance criterion is met, and that sparse regions of our sample set are 
sufficiently represented. This results in over-representation of dense regions in 
our sample set. By carefully balancing between minimality under the distance 
constraint and average distance minimization, the (5-medoids algorithm adjusts 
the representation density adaptively based on the sample set, without any prior 
assumptions. 

Although this paper establishes (5-medoids as a leading algorithm for represen¬ 
tative selection, we believe that more sophisticated algorithms can be developed 
to handle different variations of this problem, putting different emphasis on the 
minimality requirement for the representative set vs. how well the set fits the 
data. Depending on the specific nature of the task the representatives are needed 
for, different tradeoffs may be most appropriate and lead to algorithmic varia¬ 
tions. For instance, an extension of interest could be to modify the value of 
S adaptively depending of the density of sample neighborhoods. However, we 
show that (5-medoids is a promising approach to the general problem of efficient 
representative selection. 
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A Proof of Convergence for the 5-medoids Algorithm 


In this section, we show that the full proof that the (5-niedoids algorithm con¬ 
verges in finite time. 

Theorem 1 Algorithm 3 converges after a finite number of steps. 

Proof Sketch: For any sample s let us denote its associated cluster representa¬ 
tive at iteration i C'i(s). Let us denote the distance between the sample and its 
associated cluster representative as d{s,Ci{s)). Observe the overall sum of dis¬ 
tances from each point to its associated cluster representative, C(s)). 

Assume that after the i-th round, we obtain a partition to k clusters, Ci..Ck. 
Our next step is to go over each cluster and reassign a representative sample 
that minimizes the sum of distances from each point to the representative of 

that cluster, argmin^^^sd{x,s), under the constraints that all samples within 
ses 

the cluster are still within 5 distance of the representative. Since this condition 
holds prior to the minimization phase, the new representative must still either 
preserve or reduce the sum of distances within the cluster. 

We do this independently for each cluster. If the representatives are unchanged, 
we have reached convergence and the algorithm stops. Otherwise, the overall 
sum of distances is diminished. At the (z -I- l)-ith round, we reassign clusters for 
the samples. A sample can either remain within the same cluster or move to a 
different cluster. If a sample remains in the same cluster its distance from its as¬ 
sociated representative is unchanged. On the other hand, if it moves to a different 
cluster it means that d{s,Ci{s)) > d{s,Ci+i{s)), necessarily, so the overall sum 
of distances from associated cluster representatives is reduced. Therefore, after 
each iteration we either reach convergence or the sum of distances is reduced. 
Since there is a finite number of samples, there is a finite number of distance 
sums, which implies that the algorithm must converge after a finite number of 
iterations. □ 


B Stability of 5-Medoids 

In this section we establish that indeed the (5-medoids algorithm is robust with 
respect to scan order (satisfying desideratum 1 from Section 1). To test this is¬ 
sue, we generated a large collection {N = 1000) of datasets (randomly sampled 
from randomly generated multimodal distributions). For each dataset in the 
collection, we ran the algorithm ffrepetitions = 100 times, each time reshuf¬ 
fling the input order. Next, we calculated the average overlap between any two 
representative sets generated by this procedure for the same dataset. We then 
calculated a histogram of average overlap score over all the data inputs. Fi¬ 
nally, we compared these stability results to those obtained by the fc-medoids 
algorithm (which randomizes starting positions), the fc-centers heuristic (which 
randomizes the starting node), and spectral clustering (which uses fc-means to 
partition the eigenvectors of the normalized Laplacian). Our analysis indicated 



that for the first three algorithms, in more than 90% of the generated datasets, 
there was a > 90% average overlap. The overlaps observed are almost exactly 
the same, implying the expected extent of overlap is dependent more on the 
structure of the data than on the type of randomization the algorithm employs. 
This serves as evidence that d-medoids is sufficiently stable, as desired. It should 
be noted that spectral clustering yields drastically less stable results (ranging 
between 15% and 40% overlap), implying a heightened level of stochasticity in 
the partitioning phase. A histogram indicating our results can be found in Fig¬ 
ure 6. One can see that our algorithm has virtually identical stability compared 



Fig. 6. The histograms (plotted as density functions, i.e. counts normalized as per¬ 
centages) of the average overlap between representative sets found for each method for 
the same data under different permutations (overlap measured in %). For fc-medoids, 
(5-medoids and the fc-centers heuristic, in more than 90% of the datasets, there was a 
> 90% average overlap. Spectral clustering yields drastically less consistent represen¬ 
tation sets. The overlaps observed are almost exactly the same, implying the expected 
extent of overlap is dependent more on the structure of the data than on the type of 
randomization the algorithm employs. 


to both fc-medoids and the greedy /c-center approaches (which, as stated before, 
are not sensitive to scan order but contain other types of randomization). 


C Performance of 5-Medoids in Metric Spaces 

As we state in the paper, though the 5-medoids algorithm is designed to handle 
non-metric settings, it can easily be used in metric cases as well. In this section we 
compare the performance of the algorithm to the benchmark methods used in the 














Empirical Results section. To generate a standard metric setting, we consider 
a 10-dimensional metric space where samples are drawn from a multivariate 
Gaussian distribution. We sample a 1000 samples per experiment, 20 experiments 
per setting, with randomly chosen means and variances. The results are presented 
in Figure 7. 
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Fig. 7 . Representative set size in % from entire set and average representative set 
distance for four different multivariate Gaussian distributions from which the samples 
are drawn, 20 different experiments each, and four different distribution values. Each 
column represents data for a different distribution. 5-medoids yields the most compact 
representative set overall while still obtaining a smaller average distance than the k- 
centers heuristic. 


As one observe, the performance of the d-medoids algorithm relative to the other 
methods is qualitatively the same compared to the non-metric cases, despite the 
metric property of the data in this setting. 


D Extended Analysis 

In this section, we consider the hardness of the representative selection problem, 
and discuss the efficiency of the d-medoids algorithm. 

D.l NP-Hardness of the Representative Selection Problem 
Theorem 2 Satisfying Criterion 1 (minimal representative set) is NP-Hard. 






















Proof Sketch: We show this via a reduction from the vertex cover problem. 
Given a graph G = {V, E) we construct a distance matrix M of size \V\ x \V\. If 
two different vertices in the graph, Vi, Vj, i ^ j, are connected, we set the value 
of entries {i,j) and {j,i) in M to be d — 1. Otherwise, we set the value of the 
entry to <5 + 1. Formally: 



M{i,j) = <0 i=j 

I (5 + 1 otherwise 


This construction is polynomial in |P| and \E\. Let us assume we know how to 
obtain the optimal representative set for 6 in this case. Then the representative 
set Srep can be easily translated back to a vertex cover for graph G - simply 
choose all the vertices that correspond to members of the representative set. 
Every sample i in the sample set induced by M has to be within S range of 
some representative j, meaning that there is an equivalent edge {i,j) G E, 
which j covers. Since the representative set is minimal, the vertex set is also 
guaranteed to be the minimal. Therefore, if we could solve the representative 
selection problem efficiently, we could also solve the vertex cover problem. Since 
vertex cover is known to be NP-hard [Hj , then so is representative selection. □ 

D.2 Bounds on d-medoids in Metric Spaces 

The ^-medoids algorithm is agnostic to the existence of metric space in which the 
samples can be embedded. However, we can show that given that the distance 
measure which generates the pairwise distances is in fact a metric, certain bounds 
on performance ensue. 

Theorem 3 In a metric space, the average distance of a representative set \G\ = 
k obtained by the S-medoids algorithm is bound by 20PT where OPT is the 
maximal distance obtained by an optimal assignment of k representatives (with 
respect to maximal distance). 

To prove this theorem, and the following one, we will first prove the following 
helper lemma: 

Lemma 1 In a metric space, the maximal distance of a representative set \G\ = 
k obtained by the one-shot 6-representative algorithm (Algorithm 2) is bound by 
20PT where OPT is the maximal distance obtained by an optimal assignment 
of k representatives (with respect to maximal distance). 

Proof Sketch: Let \K\ = k be the representatives set returned by Algorithm 
2. Let a* be the sample which is the farthest of any points in the representative 
set, and let that distance be 5*. Consider the set K U {a*}. All k -\-l points in 
this set must be of distance > i5* from one another - the algorithm would not 
select representatives of distance < S from one another, and 6 > S*, whereas 


a* is defined as being exactly S* away from any point in K. Let us consider 
the optimal assignment of k representatives, K*, and let OPT be the maximal 
distance it achieves. By the pigeonhole principle, at least two samples in the set 
K U {a*} must be associated with the same representative. W.log, let us call 
these samples xi and X 2 , and k* G K* their associated representative. Since the 
distance between xi and X 2 is greater than S*, and since this is a metric space, 
by the triangle inequality, the distance of k* from either cannot be smaller than 
Therefore d* < 20PT. □ 

This implies that algorithm 2 is asymptotically equivalent to the fc-centers farthest- 
first traversal heuristic with respect to maximal distance. 

Now we can prove Theorem]^ 

Proof Sketch: 

First, let us consider Algorithm 2 (one-shot ^-representatives) on which the 6- 
medoids algorithm is based. By Lemma the maximal distance obtained by 
it for a representative set of size fc is < 20PT, where OPT is the maximal 
distance obtained by an optimal solution of size k (with respect to maximal 
distance). The average distance obtained by Algorithm 2 cannot be greater than 
the maximal distance, so the same bound holds the average distance as well. Now 
let us consider the full d-medoids algorithm - by definition, it can only reduce 
the average distance (while maintaining the same representative set size). So the 
average distance obtained by the (5-medoids algorithm must be bound by 20PT 
as well. □ 

Theorem 4 The size of the representative set returned by the S-medoids algo¬ 
rithm, k, is bound by k < N{^) where N{x) is the minimal number of represen¬ 
tatives required to satisfy distance criterion x. 

Proof Sketch: By Lemma [l] the maximal distance obtained by it for a repre¬ 
sentative set of size fc is < 20PT, where OPT is the maximal distance obtained 
by an optimal solution of size k (with respect to maximal distance). Let N(S) 
be the covering number for the sample set and distance criterion 6 - that is, 
the smallest number of representative required so that no sample is farther than 
6 from a representative. The size of the representative set returned by the S- 
medoids algorithm, is bound by fc < N{^). Since the full (5-medoids algorithm 
(Algorithm 3) first runs Algorithm 2 and is guaranteed to never increase the size 
of the representative set size, the same bound holds for it as well. □ 

In R‘^, the covering number N{e) is bound by O(^). Given that N{5) > K* 
where K* is the optimal selection of representatives, this implies the solution 
returned by d-medoids is bound by a factor of 0(2^). 

It is equivalent to the similar bound known for the fc-center heuristic |18ll9j . 

D.3 Hardness of Approximation of Representative Selection in 
Non-Metric Spaces 

In non-metric spaces, the representative selection problem becomes much harder. 
We now show that no c-approximation exists for the representative selection 




problem either with respect to the first criterion (representative set size) or the 
second criterion (distance - we focus on maximal distance but a similar outcome 
for average distance is implied). 

Theorem 5 No constant-factor approximation exists for the representative se¬ 
lection set problem with respect to representative set size. 

Proof Sketch: We show this via a reduction from the set cover problem. Given 
a set of n sets over |S'| = s elements, we construct a graph G = {V, E) containing 
\V\ = s + n nodes - one node for each subset, and one node for each element. The 
graph is fully connected {\E\ = \V\ x |P|). Let \N\ = n and \M\ = s be the sets 
of nodes for subsets and elements, respectively. We define the distance matrix 
between elements in the graph (i.e. weights on the edges) to be as follows: 

{ (5 — 1 i & N and j G M 
0 if both i £ N and j £ N 
(5 + 1 a i £ M 

In other words, each node representing a subset is connected to itself and the 
other subset nodes with an edge of weight 0, and to the respective node of each 
element it comprises with an edge of weight (5—1. Element nodes are connected 
to all nodes with edges of weight 5+1. This construction takes polynomial time. 
Note that the distance of any element in N (representing subsets) to itself is 0, 
and the distance of every element in M to itself is 5+1. Let us assume we have a c- 
approximating algorithm for the representative selection problem with respect to 
representative set size. Any solution obtained by this algorithm with parameter 5 
would also yield a c-approximation for the set cover problem. Let us observe any 
result of such an algorithm - it would not return any nodes representing elements 
(because they are > 5 distant from any node in the graph including themselves). 
The distance between any nodes representing subsets in the graph is 0, so a single 
subset node enough is sufficient to cover all of N. Therefore, the representative 
set will only comprise elements from TV, which directly cover elements in M. An 
optimal solution for the representative selection algorithm will also serve as an 
optimal solution for the original set cover problem, and vice versa (otherwise a 
contradiction ensues). Therefore, a c-approximation (with respect to set size) for 
the representative selection problem would also mean a c-approxmiation for the 
set cover problem. However, it is known that no approximation better than clogn 
is possible |30) . Therefore, a c-approximating algorithm for the representative 
selection problem (with respect to set size) cannot be obtained unless P = NP. 

□ 

Theorem 6 For representative sets of optimal size no constant-factor ap¬ 
proximation exists with respect to the maximal distance between the optimal rep¬ 
resentative set and the samples. 

® In fact, this proof applies for any value of k that cannot be directly manipulated by 
the algorithm. 



Proof SketchiWe show this via a reduction from the dominating set problem. 
Given a graph G = {V,E), a dominating set is defined as a subset V* <Z V so 
that every node v & V that’s not in V* is adjacent to at least one member of 
V*. Finding the minimal dominating set is known to be NP-complete m- 
Assume we are given a graph G and are required to find a minimal dominating 
set. Let us generate a new graph G' = {V,E'), where V are the original nodes 
of G and the graph is fully connected: \E\ = \V\ x \V\. The weights on the edges 
are defined as follows: 


{ (5 — 1) if (z, j) S E (original graph) 

0 i=j 

2 • c • (5 — 1) otherwise 

This reduction is polynomial. Let us consider an optimal representative set with 
parameter S for G'. Assume it is of size k. This would imply there’s a dominating 
set of size k which is the minimal dominating set obtaining in graph G. This 
dominating set is minimal, otherwise the representative selection set would not 
be optimal. Let us assume we have an algorithm for representative selection 
that’s c-approximating with respect to maximal distance. If there is a dominating 
set of size k, it would imply a guarantee of c • (d — 1) on the maximal distance, 
implying the algorithm would behave the same as an optimal algorithm (since it 
cannot use edges of weight 2 • c • (5 — 1)). For this reason, a c-maximum-distance 
approximation algorithm for the representative selection problem could be used 
to solve the dominating set problem. Since this problem is NP-hard, it implies 
no such approximation algorithm exists unless P = NP. 

□ 


D.4 Efficiency of ^-Medoids 

The actual run time of the algorithm is largely dependent on the data and the 
choice of S. An important observation is that at each iteration, each sample is 
only compared to the current representative set, and a sample is introduced to 
the representative set only if it is > 5 away from all other representatives. After 
each iteration, the representatives induce a partition to clusters and only samples 
within the same cluster are compared to one another. A poor choice of 6, for in¬ 
stance S < min{d{xi, Xj)\xi, Xj S S} would cause all the samples to be added to 
the representative set, resulting in a runtime complexity of 0(|5'p). In practice, 
however, since we only compare from samples to representatives and within clus¬ 
ters, for reasonable cases, we can get considerably better runtime performance. 
For instance, if the number of representatives is close to \/[^, the complex¬ 
ity would be reduced to which results in a significant speed-up. Again, 

note that in each iteration of the algorithm, after the partitioning phase (the 
RepAssign subroutine in Algorithms 2 and 3) the algorithm maintains a legal 
representative set, so in practice we can halt the algorithm before convergence, 
depending on need and resources. 


E Calculating the Distance Measures 

In this section we describe in some detail how the distance measures we used 
were computed, as well as some of the considerations that were involved in their 
formulation. 

E.l Musical Segments Distance 

Segment Information Every segment is transposed to C. Then, the following 
information is extracted from each segment: 

— Pitch Sequence - the sequential representation of pitch frequency over time. 
~ Pitch Bag - a “bag” containing all the pitches in the sequence, with sensitivity 

to registration. 

~ Pitch Class Bag - a “bag” containing all the pitches in the sequence, without 
sensitivity to registration. 

— Rhythm Bag - a “bag” containing all rhythmic patterns in the sequence. 
A rhythmic pattern is defined, for simplicity, as pairs of subsequent note 
durations in the sequence. 

— Interval Bag - a “bag” containing all pitch intervals in the sequence. 

— Step Bag - a “bag” containing all one-step pitch differences in the sequence, 
this is similar to intervals, only it is sensitive to direction. 

Segment Distance We devise a fairly complex distance measure between any 
two musical segments, segl and seg2. Several factors are taken into account: 

— Global alignment - the global alignment score between the two segments. 
This is calculated using the Needleman-Wunsch [?7] algorithm. 

— Local alignment - the local alignment score between the two segments. This 
is calculated using the Smith-Waterman [5S] algorithm. 

— rhythmic overlap - the extent to which one-step rhythmic patterns in the 
two segments overlap. 

— interval overlap - the extent to which one-step interval patterns in the two 
segments overlap. 

— step overlap - the extent to which melodic steps in the two segments overlap. 

— pitch overlap - the extent to which the pitch sets in the two segments overlap. 
This measure is sensitive to registration. 

— pitch class overlap - the extent to which the pitch sets in the two segments 
overlap. This measure is invariant to registration. 

The two alignment measures are combined to a single alignment score. The 
other measures were also combined to a separate score, which we name the bag 
distance. The two scores were combined using the 12 norm as follows: 

scorcaiignment = alignmeutli^^^i -I- 2 • alignmentf^^^i 

scorcbag = -I- score^^^^^^^g^i score^^^p 

+scorel,t^^ + scorelit^hci ass 

distance = i/lO • SCOrCbag + SCOrCaUgnment 



Substitution Function For both the local alignment and the global alignment 
we used a simple exponentially attenuating function based on frequency distance 
to characterize the likelihood for swaps between any two notes. The function is 
defined as follows: 


cost{A, B) 


1 , 

1 , 


1.3 


4 


\A-B\ = 3rd 
\A-B\ = 5th 
otherwise 


The price of introducing gaps was fixed at 1.5. 


Bag Distance To get the bag distance score between two bags we use the 


calculation 


\BagiABag2\ 
\BagiUBag2\ ' 


Example Two example segments are given in Figure 8 in musical notation and 
in Figure 9 as midi pitch over time. 



Fig. 8. Two segments for example, in musical notation. 


The local alignment distance for these two segments is 0.032. The global align¬ 
ment distance for these two segments is 1.5. The bag distance score for these 
two segments is 2.03 The overall combined distance for these two segments after 
weighting is 20.4. 


E.2 Movement Segments Distance 

Segment Information Each segment is comprised of one agent’s {x,y) co¬ 
ordinates for 10 consecutive timestamps. Then, each segment is translated to 
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Fig. 9. Same two segments, plotted as midi pitch over time. 


start from coordinates (0,0), and rotated so that for all segments all play¬ 
ers are facing the same goal. In addition to maintaining the coordinate se¬ 
quence, from each such segment we extract a bag of movement-turn pairs, 
where the movement represents distance turns are qunatized into 6 angle bins: 
forward (—30 - -1-30 degrees), upper right (-1-30 - -1-90 degrees), lower right 
(-1-90 - -1-150 degrees), backwards (—150 - -1-150 degrees), lower left (—90 - 
— 150 degrees), and upper left (—30 -90 degrees). For instance, the coordi¬ 

nate sequence (0, 0), (0,10), (5,10), (8,14) induces two movement-turn elements: 
10-1-upper-right-turn, 5-|-upper-left-turn. 


Segment Distance Given two trajectories, one can compare them as contours 
in 2-dimensional space. We take an alignment-based approach, with edit step 
costs being the RMS distance between them. Our distance measure is comprised 
of three elements: 

— Global Alignment - The global alignment distance between the two trajec¬ 
tories once initially aligned together (that is, originating from (0,0) coordi¬ 
nates), calculated by the Needleman-Wunsch algorithm. 

— Local Alignment - The local alignment distance between the two trajectories, 
calculated by the Smith-Waterman Algorithm. 

— Movement-Turn bag of words distance - we compare the bag distance of 
movement-turn elements. We quantize distances into a resolution of 5 meters 
to account for variation. 











— Overall A-distance and A-angle distance - We also consider the overall sim¬ 
ilarity of the segments in terms of total distance travelled (and the direction 
of the movement). 

The scores are combined as follows: 


scorcaiign = alignmenfgi^f,^i + 2.5 * alignmentf^^^i 
scoreoveraiiA = A — distance^ -f (10 • Z\ — angle)^ 
distance = ^100 • SCOrCbag + SCOrCaUgn + SCOreOverallA 


Substitution Function To get the substitution cost for the two alignment 
algorithms, we simply use the RMS distance between the two coordinates we 
are comparing. Given two points Pi = {xi,yi) and P 2 = ( 2 : 2 , 2 / 2 ) the distance is 
simply D{Pi,P 2 ) = — 2 : 2 )^ + ( 2/1 — 2 / 2 )^- Gaps were greatly penalized with 

a penalty of 100 because gaps create discontinuous (and therefore physically 
impossible) sequences. 


Bag Distance To get the bag distance score between two bags we use the 

ralrnlatinn \BagiABag2\ 
calculation \BagiUBag2\- 


Example Two example segments are given in Figure 10. 

The local alignment distance for these two segments is 20. The global alignment 
distance for these two segments is 192.7. The overall delta distance and angle 
score for these two segments is 61.66 The overall combined distance for these 
two segments after weighting is 258.72. 




Fig. 10. Two movement segments. Each coordinate in the trajectory is labelled with 
its timestamp in the trajectory € [0..9]. Both segments begin with a long sprint towards 
one direction and then a sequence of small steps in the opposite direction (scales are 
xlO). 



